Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/mlfoundations/open_clip/llms.txt

Use this file to discover all available pages before exploring further.

Overview

OpenCLIP provides extensive configuration options for training CLIP models. This page documents all important training flags and hyperparameters from params.py. To see all available options:
python -m open_clip_train.main --help

Data Configuration

Training Data

--train-data
string
Path to training data. For WebDataset, use glob patterns like /data/train-{0000..2175}.tar. Multiple sources can be combined with ::.
--train-data "/data/cc12m/train-{0000..2175}.tar"
--train-data "/data/cc12m/train.tar::/data/laion/train.tar"  # Multiple sources
--val-data
string
Path to validation data (same format as train-data).
--val-data "/data/val.csv"
--train-num-samples
integer
Total number of samples in training dataset. Required for WebDataset.
--train-num-samples 10968539  # CC12M
--val-num-samples
integer
Number of samples in validation dataset.
--dataset-type
string
default:"auto"
Dataset format: webdataset, csv, synthetic, or auto (auto-detect).
--dataset-type webdataset
--dataset-resampled
boolean
Enable sampling with replacement for webdataset. Recommended for large datasets and multiple data sources.
--dataset-resampled

CSV Data Parameters

--csv-separator
string
default:"\\t"
Column separator for CSV files (tab by default).
--csv-separator ","  # Use comma separator
--csv-img-key
string
default:"filepath"
Column name for image paths in CSV.
--csv-img-key filepath
--csv-caption-key
string
default:"title"
Column name for captions in CSV.
--csv-caption-key title

Data Upsampling

--train-data-upsampling-factors
string
Upsampling factors for multiple data sources, separated by ::. Controls relative sampling probability.
--train-data "/data/cc12m/train.tar::/data/cc3m/train.tar" \
--train-data-upsampling-factors "1::4"  # Sample CC3M 4x more frequently

Model Configuration

Model Selection

--model
string
default:"RN50"
Model architecture to train. See Model Architectures for all options.
--model ViT-B-32
--model ViT-L-14
--model RN50
--model coca_ViT-L-14  # CoCa model
--pretrained
string
Load pretrained weights. Can be a tag (e.g., laion2b_s34b_b79k) or a local path.
--pretrained laion2b_s34b_b79k
--pretrained /path/to/checkpoint.pt
--pretrained-image
boolean
Load ImageNet pretrained weights for the image encoder (if available).
--pretrained-image

Model Modifications

--force-image-size
integer
Override default image input size.
--force-image-size 224
--force-image-size 336 336  # Different height/width
--force-context-length
integer
Override default text context length.
--force-context-length 77
--force-patch-dropout
float
Override patch dropout probability for ViT models. Use 0.5-0.75 for 2-3x speedup.
--force-patch-dropout 0.5  # 50% patch dropout
--force-patch-dropout 0.0  # Disable patch dropout (fine-tuning)
--force-quick-gelu
boolean
Force QuickGELU activation (for compatibility with older checkpoints).
--force-custom-text
boolean
Force separate text tower (CustomTextCLIP architecture).

Training Hyperparameters

Batch Size and Epochs

--batch-size
integer
default:"64"
Batch size per GPU. Total batch size = batch_size × num_gpus × accum_freq.
--batch-size 256
--epochs
integer
default:"32"
Number of training epochs.
--epochs 32
--accum-freq
integer
default:"1"
Gradient accumulation frequency. Simulates larger batch sizes.
--accum-freq 4  # Effective batch = batch_size × 4

Learning Rate

--lr
float
Learning rate. Default depends on model:
  • ViT models: 5e-4
  • ResNet models: 5e-4
--lr 1e-3
--lr 5e-4
--warmup
integer
default:"10000"
Number of warmup steps (linear warmup from 0 to lr).
--warmup 10000
--lr-scheduler
string
default:"cosine"
Learning rate schedule: cosine, const, or const-cooldown.
--lr-scheduler cosine
--lr-scheduler const  # Constant LR after warmup
--lr-scheduler const-cooldown  # Constant with cooldown
--epochs-cooldown
integer
Number of cooldown epochs for const-cooldown scheduler.
--lr-scheduler const-cooldown \
--epochs-cooldown 5
--lr-cooldown-end
float
default:"0.0"
End learning rate for cooldown.
--lr-cooldown-end 1e-6
--lr-cooldown-power
float
default:"1.0"
Power for polynomial cooldown (1.0 = linear).
--lr-cooldown-power 1.0

Optimizer

--opt
string
default:"adamw"
Optimizer choice. Use adamw or timm/{optimizer} for timm optimizers.
--opt adamw
--opt timm/sgd
--beta1
float
Adam beta1 parameter. Default:
  • ViT: 0.9
  • ResNet: 0.9
--beta1 0.9
--beta2
float
Adam beta2 parameter. Default:
  • ViT: 0.98
  • ResNet: 0.999
--beta2 0.98
--eps
float
Adam epsilon parameter. Default:
  • ViT: 1e-6
  • ResNet: 1e-8
--eps 1e-6
--wd
float
default:"0.2"
Weight decay (L2 regularization).
--wd 0.2
--wd 0.1
--momentum
float
Momentum for timm optimizers (SGD, etc.).
--momentum 0.9

Gradient Clipping

--grad-clip-norm
float
Gradient clipping norm. Prevents gradient explosion.
--grad-clip-norm 1.0

Precision and Memory

Precision

--precision
string
default:"amp"
Training precision: amp, amp_bf16, bf16, fp16, fp32.
--precision amp        # Automatic Mixed Precision (FP16) - Recommended
--precision amp_bf16   # AMP with BFloat16 (A100/H100)
--precision fp32       # Full precision (slow, baseline)

Memory Optimization

--grad-checkpointing
boolean
Enable gradient checkpointing to reduce memory usage (slower training).
--grad-checkpointing
--local-loss
boolean
Calculate loss with local features @ global (reduces memory from O(n²) to O(n)).
--local-loss
--gather-with-grad
boolean
Enable gradient flow through feature gathering (use with —local-loss).
--gather-with-grad
Always use --local-loss and --gather-with-grad together for multi-GPU training (8+ GPUs). See Distributed Training.

Data Loading

--workers
integer
default:"4"
Number of data loading workers per GPU.
--workers 8  # 8 workers per GPU
Recommended: 4-8 workers per GPU for optimal performance.

Image Preprocessing

--image-mean
float[]
Override image normalization mean (RGB).
--image-mean 0.485 0.456 0.406  # ImageNet statistics
--image-std
float[]
Override image normalization std (RGB).
--image-std 0.229 0.224 0.225  # ImageNet statistics
--image-interpolation
string
Image resize interpolation: bicubic, bilinear, or random.
--image-interpolation bicubic
--image-resize-mode
string
Image resize mode: shortest, longest, or squash (inference only).
--image-resize-mode shortest
--aug-cfg
key=value
Data augmentation configuration (key-value pairs).
--aug-cfg scale_range=0.08::1.0 ratio_range=0.75::1.33

Model Locking (Transfer Learning)

Image Tower

--lock-image
boolean
Lock (freeze) entire image encoder.
--lock-image
--lock-image-unlocked-groups
integer
default:"0"
Leave last N image tower layer groups unlocked.
--lock-image --lock-image-unlocked-groups 2  # Freeze all but last 2 groups
--lock-image-freeze-bn-stats
boolean
Freeze BatchNorm running statistics in locked layers.
--lock-image-freeze-bn-stats

Text Tower

--lock-text
boolean
Lock (freeze) entire text encoder.
--lock-text
--lock-text-unlocked-layers
integer
default:"0"
Leave last N text tower layers unlocked.
--lock-text --lock-text-unlocked-layers 10  # Train last 10 layers
--lock-text-freeze-layer-norm
boolean
Freeze LayerNorm in locked text layers.
--lock-text-freeze-layer-norm

Checkpointing and Logging

Checkpoints

--save-frequency
integer
default:"1"
Save checkpoint every N epochs.
--save-frequency 1  # Save every epoch
--save-frequency 5  # Save every 5 epochs
--save-most-recent
boolean
Save most recent checkpoint as epoch_latest.pt.
--save-most-recent
--delete-previous-checkpoint
boolean
Delete previous checkpoint after saving new one (saves disk space).
--delete-previous-checkpoint
--resume
string
Resume training from checkpoint path or “latest”.
--resume /path/to/checkpoint.pt
--resume latest  # Resume from latest checkpoint

Logging

--logs
string
default:"./logs/"
Directory for logs and checkpoints.
--logs ./logs/
--name
string
Experiment name (defaults to auto-generated based on timestamp and config).
--name "vit-b32-cc12m-experiment"
--report-to
string
Logging backends: tensorboard, wandb, or tensorboard,wandb.
--report-to tensorboard
--report-to wandb
--report-to tensorboard,wandb  # Both
--log-every-n-steps
integer
default:"100"
Log training metrics every N steps.
--log-every-n-steps 100

Weights & Biases

--wandb-project-name
string
default:"open-clip"
W&B project name.
--wandb-project-name "my-clip-experiments"
--wandb-notes
string
Notes for W&B run.
--wandb-notes "Testing new learning rate schedule"

Evaluation

--imagenet-val
string
Path to ImageNet validation set for zero-shot evaluation during training.
--imagenet-val /data/imagenet/validation/
--imagenet-v2
string
Path to ImageNet-v2 for additional zero-shot evaluation.
--imagenet-v2 /data/imagenet-v2/
--zeroshot-frequency
integer
default:"2"
Run zero-shot evaluation every N epochs.
--zeroshot-frequency 1  # Every epoch
--val-frequency
integer
default:"1"
Run validation every N epochs.
--val-frequency 1

CoCa-Specific Parameters

--coca-contrastive-loss-weight
float
default:"1.0"
Weight for CoCa contrastive loss.
--coca-contrastive-loss-weight 1.0
--coca-caption-loss-weight
float
default:"2.0"
Weight for CoCa caption generation loss.
--coca-caption-loss-weight 2.0
For CoCa fine-tuning on captioning only:
--coca-contrastive-loss-weight 0 \
--coca-caption-loss-weight 1

Distributed Training

--dist-url
string
URL for distributed training initialization.
--dist-url tcp://localhost:12345
--dist-backend
string
Distributed backend: nccl (NVIDIA GPU), hccl (Ascend NPU), or gloo (CPU).
--dist-backend nccl  # Default for GPU
--horovod
boolean
Use Horovod for distributed training.
--horovod
--ddp-static-graph
boolean
Enable static graph optimization for DDP (PyTorch >= 1.11).
--ddp-static-graph
--use-bn-sync
boolean
Use synchronized batch normalization across GPUs.
--use-bn-sync

Advanced Options

Compilation

--torchcompile
boolean
Compile model with torch.compile() (PyTorch >= 2.0).
--torchcompile
--torchscript
boolean
TorchScript the model.
--torchscript
--trace
boolean
Trace model with torch.jit.trace (inference only).
--trace

Model Distillation

--distill-model
string
Teacher model architecture for distillation.
--distill-model ViT-L-14
--distill-pretrained
string
Teacher model pretrained weights.
--distill-pretrained openai

Loss Configuration

--siglip
boolean
Use SigLip (sigmoid) loss instead of standard CLIP loss.
--siglip
--loss-dist-impl
string
Distributed loss implementation override.
--loss-dist-impl custom

Remote Syncing

--remote-sync
string
Remote path to sync checkpoints (S3 bucket or filesystem).
--remote-sync s3://my-bucket/checkpoints
--remote-sync-frequency
integer
default:"300"
Sync to remote every N seconds.
--remote-sync-frequency 600  # Sync every 10 minutes
--remote-sync-protocol
string
default:"s3"
Protocol for remote sync: s3 or fsspec.
--remote-sync-protocol s3

Experimental

--use-bnb-linear
string
Use bitsandbytes linear layers for int8 training (experimental).
--use-bnb-linear SwitchBackLinearGlobal

Other

--seed
integer
default:"0"
Random seed for reproducibility.
--seed 42
--device
string
default:"cuda"
Device for training: cuda or cpu.
--device cuda
--cache-dir
string
Override default cache directory for model/tokenizer downloads.
--cache-dir /path/to/cache
--debug
boolean
Enable debug logging.
--debug
--log-local
boolean
Log on local master (each node) instead of global master only.
--log-local
--copy-codebase
boolean
Copy entire codebase to log directory.
--copy-codebase

Example Configurations

Small-Scale Training (RN50 on CC3M)

python -m open_clip_train.main \
    --train-data "/data/cc3m/train.csv" \
    --dataset-type csv \
    --csv-img-key filepath \
    --csv-caption-key title \
    --batch-size 256 \
    --precision amp \
    --workers 4 \
    --warmup 2000 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 30 \
    --model RN50 \
    --save-frequency 5 \
    --report-to tensorboard

Medium-Scale Training (ViT-B/32 on CC12M)

torchrun --nproc_per_node 4 -m open_clip_train.main \
    --train-data "/data/cc12m/cc12m-{0000..2175}.tar" \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 320 \
    --precision amp \
    --workers 6 \
    --imagenet-val /data/imagenet/validation/ \
    --warmup 10000 \
    --lr 1e-3 \
    --wd 0.1 \
    --epochs 32 \
    --model ViT-B-32 \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --local-loss \
    --gather-with-grad \
    --report-to wandb

Large-Scale Training (ViT-L/14 on LAION-400M)

srun python -u src/open_clip_train/main.py \
    --train-data "/data/laion400m/{00000..41455}.tar" \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 128 \
    --precision amp \
    --grad-checkpointing \
    --workers 8 \
    --warmup 10000 \
    --lr 5e-4 \
    --wd 0.2 \
    --epochs 32 \
    --model ViT-L-14 \
    --save-frequency 1 \
    --zeroshot-frequency 2 \
    --local-loss \
    --gather-with-grad \
    --force-patch-dropout 0.5 \
    --report-to wandb \
    --remote-sync s3://bucket/checkpoints \
    --delete-previous-checkpoint

ViT-B/32

--model ViT-B-32 \
--lr 5e-4 \
--beta1 0.9 \
--beta2 0.98 \
--eps 1e-6 \
--batch-size 256-512 \
--precision amp

ViT-L/14

--model ViT-L-14 \
--lr 5e-4 \
--beta1 0.9 \
--beta2 0.98 \
--eps 1e-6 \
--batch-size 128-256 \
--precision amp \
--grad-checkpointing \
--force-patch-dropout 0.5

RN50

--model RN50 \
--lr 5e-4 \
--beta1 0.9 \
--beta2 0.999 \
--eps 1e-8 \
--batch-size 256-512 \
--precision amp

Next Steps

Single-Node Training

Apply these configurations to single-node training

Distributed Training

Configure distributed training optimizations

Data Preparation

Configure data loading and preprocessing

Fine-tuning

Configure fine-tuning from pretrained models